This project is a linear regression analysis of housing prices, based on a publicly available data set on housing prices in the city of Ames, Iowa, that was curated by Dean De Cock of Truman state university.
For this project, our goal will be to attempt to find, using a variety of statistical modelling techniques in R, a “good” (below) linear regression model that both performs well and is easily interpretable for explanatory purposes.
A “good model” should:
The data set consists of 1460 observations with one numerical response (house price) and 80 predictors. The predictors are a relatively even mix of numerical and categorical variables.
The predictors are summarized in the following table:
| Variable name | Description | Factor Levels |
|---|---|---|
| SalePrice | the property’s sale price in dollars. This is the target variable that you’re trying to predict. | |
| mssubclass | the building class | |
| MSZoning | The general zoning classification | C (all), FV, RH, RL, RM |
| LotFrontage | Linear feet of street connected to property | |
| LotArea | Lot size in square feet | |
| Street | Type of road access | Grvl, Pave |
| Alley | Type of alley access | Grvl, Pave |
| LotShape | General shape of property | IR1, IR2, IR3, Reg |
| LandContour | Flatness of the property | Bnk, HLS, Low, Lvl |
| Utilities | Type of utilities available | AllPub, NoSeWa |
| LotConfig | Lot configuration | Corner, CulDSac, FR2, FR3, Inside |
| LandSlope | Slope of property | Gtl, Mod, Sev |
| Neighborhood | Physical locations within Ames city limits | Blmngtn, Blueste, … |
| Condition1 | Proximity to main road or railroad | Artery, Feedr, Norm, PosA, … |
| Condition2 | Proximity to main road or railroad (if a second is present) | Artery, Feedr, … |
| BldgType | Type of dwelling | 1Fam, 2fmCon, Duplex, Twnhs, TwnhsE |
| HouseStyle | Style of dwelling | 1.5Fin, 1.5Unf, 1Story, 2.5Fin, 2.5Unf, 2Story, SFoyer, SLvl |
| OverallQual | Overall material and finish quality | |
| OverallCond | Overall condition rating | |
| YearBuilt | Original construction date | |
| YearRemodAdd | Remodel date | |
| RoofStyle | Type of roof | Flat, Gable, Gambrel, Hip, Mansard, Shed |
| RoofMatl | Roof material | ClyTile, CompShg, Membran, Metal, Roll, Tar&Grv, WdShake, WdShngl |
| Exterior1st | Exterior covering on house | AsbShng, AsphShn, BrkComm, BrkFace, CBlock, … |
| Exterior2nd | Exterior covering on house (if more than one material) | AsbShng, AsphShn, … |
| MasVnrType | Masonry veneer type | BrkCmn, BrkFace, None, Stone |
| MasVnrArea | Masonry veneer area in square feet | |
| ExterQual | Exterior material quality | Ex, Fa, Gd, TA |
| ExterCond | Present condition of the material on the exterior | Ex, Fa, Gd, Po, TA |
| Foundation | Type of foundation | BrkTil, CBlock, PConc, Slab, Stone, Wood |
| BsmtQual | Height of the basement | Ex, Fa, Gd, TA |
| BsmtCond | General condition of the basement | Fa, Gd, Po, TA |
| BsmtExposure | Walkout or garden level basement walls | Av, Gd, Mn, No |
| BsmtFinType1 | Quality of basement finished area | ALQ, BLQ, GLQ, LwQ, Rec, Unf |
| BsmtFinSF1 | Type 1 finished square feet | |
| BsmtFinType2 | Quality of second finished area (if present) | ALQ, BLQ, GLQ, LwQ, Rec, Unf |
| BsmtFinSF2 | Type 2 finished square feet | |
| BsmtUnfSF | Unfinished square feet of basement area | |
| TotalBsmtSF | Total square feet of basement area | |
| Heating | Type of heating | Floor, GasA, GasW, Grav, OthW, Wall |
| HeatingQC | Heating quality and condition | Ex, Fa, Gd, Po, TA |
| CentralAir | Central air conditioning | N, Y |
| Electrical | Electrical system | FuseA, FuseF, FuseP, Mix, SBrkr |
| 1stFlrSF | First Floor square feet | |
| 2ndFlrSF | Second floor square feet | |
| LowQualFinSF | Low quality finished square feet (all floors) | |
| GrLivArea | Above grade (ground) living area square feet | |
| BsmtFullBath | Basement full bathrooms | |
| BsmtHalfBath | Basement half bathrooms | |
| FullBath | Full bathrooms above grade | |
| HalfBath | Half baths above grade | |
| Bedroom | Number of bedrooms above basement level | |
| Kitchen | Number of kitchens | |
| KitchenQual | Kitchen quality | Ex, Fa, Gd, TA |
| TotRmsAbvGrd | Total rooms above grade (does not include bathrooms) | |
| Functional | Home functionality rating | Maj1, Maj2, Min1, Min2, Mod, Sev, Typ |
| Fireplaces | Number of fireplaces | |
| FireplaceQu | Fireplace quality | Ex, Fa, Gd, Po, TA |
| GarageType | Garage location | 2Types, Attchd, Basment, BuiltIn, CarPort, Detchd |
| GarageYrBlt | Year garage was built | |
| GarageFinish | Interior finish of the garage | Fin, RFn, Unf |
| GarageCars | Size of garage in car capacity | |
| GarageArea | Size of garage in square feet | |
| GarageQual | Garage quality | Ex, Fa, Gd, Po, TA |
| GarageCond | Garage condition | Ex, Fa, Gd, Po, TA |
| PavedDrive | Paved driveway | N, P, Y |
| WoodDeckSF | Wood deck area in square feet | |
| OpenPorchSF | Open porch area in square feet | |
| EnclosedPorch | Enclosed porch area in square feet | |
| 3SsnPorch | Three season porch area in square feet | |
| ScreenPorch | Screen porch area in square feet | |
| PoolArea | Pool area in square feet | |
| PoolQC | Pool quality | Ex, Fa, Gd |
| Fence | Fence quality | GdPrv, GdWo, MnPrv, MnWw |
| MiscFeature | Miscellaneous feature not covered in other categories | Gar2, Othr, Shed, TenC |
| MiscVal | $Value of miscellaneous feature | |
| MoSold | Month Sold | |
| YrSold | Year Sold | |
| SaleType | Type of sale | COD, Con, ConLD, ConLI, ConLw, CWD, New, Oth, WD |
| SaleCondition | Condition of sale | Abnorml, AdjLand, Alloca, Family, Normal, Partial |
\[ \begin{align*} Hypothesis : \ \ &\text{Using only the characteristics most prominently displayed across typical real estate websites,}\\ &\text{we can build a "good" (as described above) linear model for predicting home prices.} \end{align*} \] Looking at the default view of some of the listings on realtor.com, we can see that the following information is displayed.
Other realtor websites such as zillow.com, redfin.com, etc. also show similar data for each listing. The fact that all of these different websites display the same default data tells us that it is likely that these are the most important aspects of a house for a potential buyer. Under our hypothesis, we assume these are highly correlated with the sale price.
Additionally, the various realtor websites allow one specify the dwelling type, such as condo, house, aparment, etc. This suggests these will be important characteristics as well.
The top options under the website filters include the age of home, the number of stories, and whether it has heating or cooling.
Taking all of this into account, we will build a model using the information most prominently featured on realtor websites. The initial numeric variables that we will use in this approach are:
We consider a small subset of factor variables that seem the most likely to determine price, these include type of building (BldgType), the style of the home (HouseStyle), and the inclusion of heating and central AC.
Also, note that two of the variables in the data are OverallQual and OverallCond. It’s not too much of a stretch to assume that the overall quality of the construction of the house, and the condition of the house would also factor into the asking price for the home. But the way this information is displayed on the website is not with a number, but typically with a paragraph or two, such as:
We will be using the presence of descriptive paragraphs such as these to support my ability to include these two variables under my hypothesis. Essentially we will be assuming that the score given by OverallCond and OverallQual is more or less an accurate heuristic for converting these sorts of descriptions into a number.
library(tidyverse)
library(lmtest)
library(faraway)
library(MASS)
library(corpcor)
library(corrplot)
library(ggcorrplot)
library(GGally)
library(leaps)
library(knitr)
library(kableExtra)
library(caret)
library(outliers)
library(ggplot2)
library(ggthemes)
diagnostics = function(model, pcol = "grey", lcol = "dodgerblue"){
par(mfrow = c(1,2))
plot(fitted(model), resid(model), col = pcol, pch = 20,
xlab = "Fitted", ylab = "Residuals", main = paste("Fitted Vs Residuals"))
abline(h = 0, col = lcol, lwd = 2)
qqnorm(resid(model), col = pcol)
qqline(resid(model), lty = 2, lwd = 2, col = lcol)
}
mape = function(actual, pred){
mean(abs((actual - pred) / actual)) * 100
}
get_loocv_rmse = function(model) {
sqrt(mean((resid(model) / (1 - hatvalues(model))) ^ 2))
}
get_adj_r2 = function(model) {
summary(model)$adj.r.squared
}
kfold_cv = function(data, model, k, log_response = FALSE) {
set.seed(600)
# Get indices for k folds (the data to be held out)
holdouts = split(sample(1:nrow(data)), 1:k)
total_mape = 0
# Now for each holdout fit a model to the data not held out, and test against held out data
for (holdout in holdouts) {
train = data[-c(holdout),]
test = data[c(holdout),]
m = lm(formula(model), data = train)
pred = predict(m, newdata = test)
if (log_response) {
total_mape = total_mape + mape(test$SalePrice, exp(pred))
} else {
total_mape = total_mape + mape(test$SalePrice, pred)
}
}
total_mape / k
}
We load the “train.csv” file and observe the available variables.
house_prices_full = read.csv("train.csv")
#str(house_prices_full)
Note that there is also a “test.csv” file, but it does not contain the SalePrice response variable. We split the data from “train.csv” randomly into a 75/25 train/test split as house_prices and house_prices_test. We will train our models on house_prices and evaluate our final model in terms of how well it works to make predictions on house_prices_test.
set.seed(08032019)
house_prices_idx = sample(nrow(house_prices_full), size = trunc(0.75 * nrow(house_prices_full)))
house_prices = house_prices_full[house_prices_idx, ]
house_prices_test = house_prices_full[-house_prices_idx, ]
Based on our above real-world assumptions and review of realtor websites, we make the following initial variable selection:
We first build a large additive model that includes all of the above predictors.
house_prices.lm = lm(SalePrice ~ GrLivArea + LotArea + GarageArea + BedroomAbvGr + FullBath + HalfBath + TotalBsmtSF + OverallQual + OverallCond + YearBuilt + BldgType + HouseStyle + CentralAir + Heating, data = house_prices)
We then test this model using a training data set and a k-fold cross validation.
(mape1 = kfold_cv(house_prices, house_prices.lm, 5))
## [1] 13.42263
The reported error for this initial model is 13.4226254.
Let’s look at the Fitted Vs Residuals Plot and the Normal Q-Q Plot.
diagnostics(house_prices.lm)
We can see that the constant variance assumption seems to be violated, since the variation increases as you move to the right, The fitted vs residuals plot also exhibits a crescent shape indicating non-linearly. It seems clear that the there is an underlying pattern to the variance, indicating that a transformation might be applicable. Since the variation seems to grow exponentially as it moves to the right, we suspect that a log transformation of the reponse may be effective.
We fit a model using the log of the response (sale price), and assess the diagnostic plots for the log-response model.
house_prices.lm = lm(log(SalePrice) ~ GrLivArea + LotArea + GarageArea + BedroomAbvGr + FullBath + HalfBath + TotalBsmtSF + OverallQual + OverallCond + YearBuilt + BldgType + HouseStyle + CentralAir + Heating, data = house_prices)
diagnostics(house_prices.lm)
It appears that a log transformation of the response did help make the distribution of error more equal and normal than before.
We now check this new model’s MAPE for accuracy:
(mape2 = kfold_cv(house_prices, house_prices.lm, 5, log_response = TRUE))
## [1] 11.94528
We see that this model’s error of 11.9452845% has improved slightly relative to the original (non-log-transformed) version.
We check the model for the significance of the predictors by doing a t-test of each parameter, using an \(\alpha = 0.10\) for significance.
summary(house_prices.lm)
##
## Call:
## lm(formula = log(SalePrice) ~ GrLivArea + LotArea + GarageArea +
## BedroomAbvGr + FullBath + HalfBath + TotalBsmtSF + OverallQual +
## OverallCond + YearBuilt + BldgType + HouseStyle + CentralAir +
## Heating, data = house_prices)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.13813 -0.07149 0.00254 0.07319 0.49384
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.530e+00 6.011e-01 5.873 5.71e-09 ***
## GrLivArea 2.514e-04 2.159e-05 11.647 < 2e-16 ***
## LotArea 2.228e-06 6.170e-07 3.611 0.000319 ***
## GarageArea 2.144e-04 3.115e-05 6.882 1.00e-11 ***
## BedroomAbvGr -3.184e-03 8.217e-03 -0.388 0.698451
## FullBath 4.330e-02 1.437e-02 3.013 0.002651 **
## HalfBath 3.570e-02 1.419e-02 2.515 0.012052 *
## TotalBsmtSF 5.205e-05 1.899e-05 2.741 0.006230 **
## OverallQual 9.496e-02 6.066e-03 15.655 < 2e-16 ***
## OverallCond 5.240e-02 5.124e-03 10.226 < 2e-16 ***
## YearBuilt 3.513e-03 3.133e-04 11.212 < 2e-16 ***
## BldgType2fmCon 1.296e-02 3.724e-02 0.348 0.727924
## BldgTypeDuplex -1.119e-01 2.891e-02 -3.871 0.000115 ***
## BldgTypeTwnhs -1.737e-01 2.934e-02 -5.921 4.32e-09 ***
## BldgTypeTwnhsE -5.714e-02 2.017e-02 -2.833 0.004704 **
## HouseStyle1.5Unf 3.929e-02 6.214e-02 0.632 0.527331
## HouseStyle1Story 4.940e-02 2.109e-02 2.343 0.019329 *
## HouseStyle2.5Fin -3.957e-02 6.629e-02 -0.597 0.550629
## HouseStyle2.5Unf 1.534e-04 5.392e-02 0.003 0.997731
## HouseStyle2Story -3.325e-02 2.106e-02 -1.579 0.114739
## HouseStyleSFoyer 4.216e-02 3.714e-02 1.135 0.256608
## HouseStyleSLvl 2.796e-02 2.792e-02 1.001 0.316951
## CentralAirY 6.352e-02 2.692e-02 2.360 0.018477 *
## HeatingGasW 3.178e-02 4.923e-02 0.646 0.518716
## HeatingGrav -7.179e-02 7.207e-02 -0.996 0.319405
## HeatingOthW -7.618e-02 1.183e-01 -0.644 0.519833
## HeatingWall 9.365e-02 8.643e-02 1.084 0.278811
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1592 on 1068 degrees of freedom
## Multiple R-squared: 0.8389, Adjusted R-squared: 0.835
## F-statistic: 214 on 26 and 1068 DF, p-value: < 2.2e-16
We see that there are numerous \(\beta\)-parameters with high p-values, such as for example BedroomAbvGr, Heating, and HouseStyle. We attempt to remove these and see if our model improves.
house_prices.lm = lm(log(SalePrice) ~ GrLivArea + LotArea + GarageArea + FullBath + HalfBath + TotalBsmtSF + OverallQual + OverallCond + YearBuilt + BldgType + CentralAir, data = house_prices)
summary(house_prices.lm)
##
## Call:
## lm(formula = log(SalePrice) ~ GrLivArea + LotArea + GarageArea +
## FullBath + HalfBath + TotalBsmtSF + OverallQual + OverallCond +
## YearBuilt + BldgType + CentralAir, data = house_prices)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.22590 -0.07210 0.00162 0.07552 0.49934
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.462e+00 5.617e-01 6.165 9.96e-10 ***
## GrLivArea 2.108e-04 1.788e-05 11.792 < 2e-16 ***
## LotArea 2.288e-06 6.166e-07 3.711 0.000217 ***
## GarageArea 2.375e-04 3.044e-05 7.800 1.45e-14 ***
## FullBath 3.653e-02 1.388e-02 2.631 0.008627 **
## HalfBath 9.672e-03 1.257e-02 0.769 0.441845
## TotalBsmtSF 9.293e-05 1.536e-05 6.049 2.00e-09 ***
## OverallQual 9.220e-02 5.848e-03 15.766 < 2e-16 ***
## OverallCond 5.249e-02 5.082e-03 10.328 < 2e-16 ***
## YearBuilt 3.573e-03 2.913e-04 12.267 < 2e-16 ***
## BldgType2fmCon -3.680e-03 3.680e-02 -0.100 0.920355
## BldgTypeDuplex -8.885e-02 2.718e-02 -3.269 0.001115 **
## BldgTypeTwnhs -1.903e-01 2.841e-02 -6.699 3.37e-11 ***
## BldgTypeTwnhsE -5.644e-02 1.923e-02 -2.935 0.003411 **
## CentralAirY 6.401e-02 2.421e-02 2.643 0.008325 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1599 on 1080 degrees of freedom
## Multiple R-squared: 0.8357, Adjusted R-squared: 0.8336
## F-statistic: 392.4 on 14 and 1080 DF, p-value: < 2.2e-16
Now all of our \(\beta\)-parameters, apart from the dummy variable for BldgType2fmCon seem to be significant. Because the other dummy variables seem to be significant, we’ll leave BldgType in.
Let’s look at the Fitted Vs Residuals and the Q-Q Plot for this new model
diagnostics(house_prices.lm)
The variance looks consistent from left to right. When looking at the Q-Q Plot we can see that there seems to be some significant deviation from the line towards the left side of the graph.
We also check cross-validated MAPE:
(mape3 = kfold_cv(house_prices, house_prices.lm, 5, log_response = TRUE))
## [1] 11.69764
We see that the performance of our regression continues to improve, with a cross-validated MAPE of 11.6976414.
We still have an issue with our model diagnostics, as the Normal Q-Q plot still shows substantial deviation from normal for our residuals.
OUr log response transformation has not worked to correct the problem and so we now turn to outliers.
We examune a plot showing leverage, standardized residuals, and Cook’s distance (red dotted lines).
plot(house_prices.lm, which = 5)
There is one quite extreme outlier in observation #1299. Looking at the underlying data, we see that the data quality of this point is highly questionable. We conclude that this point can be safely removed from the data set.
influential_indices = as.vector(which(cooks.distance(house_prices.lm) > 4/length(cooks.distance(house_prices.lm))))
house_prices_outliers_removed.lm = lm(log(SalePrice) ~ GrLivArea + LotArea + GarageArea + FullBath + HalfBath + TotalBsmtSF + OverallQual + OverallCond + YearBuilt + BldgType + CentralAir, data = house_prices[-influential_indices[1], ])
diagnostics(house_prices_outliers_removed.lm)
Removing this one observation did improve the Normal Q-Q plot, but not enough to convince us that our residuals follow a normal distribution.
We check the new model accuracy:
(mape4 = kfold_cv(house_prices[-influential_indices[1],], house_prices_outliers_removed.lm, 547, log_response = TRUE))
## [1] 11.82133
With the removal of a single bad observation we have improved the model performance substantially. We now have a MAPE of 11.8213331, exceeding our original goal for the project. We have a very good model for prediction performance.
Our diagnostic plots are still not entirely satisfactory. Let’s continue our analysis of outliers.
plot(house_prices_outliers_removed.lm, which = 5)
We see that there are still significant outliers in the model. We will now fit our model after removing all outliers (defined as those points have a Cook’s distance of > 4/n).
influential_indices = as.vector(which(cooks.distance(house_prices.lm) > 4/length(cooks.distance(house_prices.lm))))
house_prices_outliers_removed.lm = lm(log(SalePrice) ~ GrLivArea + LotArea + GarageArea + FullBath + HalfBath + TotalBsmtSF + OverallQual + OverallCond + YearBuilt + BldgType + CentralAir, data = house_prices[-influential_indices, ])
diagnostics(house_prices_outliers_removed.lm)
The results are satisfying, at least visually. We have plots that give some confidence of having equal variance and normal distribution of error.
bptest(house_prices_outliers_removed.lm)
##
## studentized Breusch-Pagan test
##
## data: house_prices_outliers_removed.lm
## BP = 73.959, df = 14, p-value = 3.671e-10
shapiro.test(resid(house_prices_outliers_removed.lm))
##
## Shapiro-Wilk normality test
##
## data: resid(house_prices_outliers_removed.lm)
## W = 0.99161, p-value = 1.286e-05
While we would still have to reject the null hypothesis with these two tests and conclude that it is unlikely that our error distribution is equal and normal, the performance is better than the unmodified original model.
We now look at how this model based on removing outliers performs using k-fold cross validation:
length(coef(house_prices_outliers_removed.lm))
## [1] 15
(mape5 = kfold_cv(house_prices[-influential_indices, ], house_prices_outliers_removed.lm, 6, log_response = TRUE))
## Warning in split.default(sample(1:nrow(data)), 1:k): data length is not a
## multiple of split variable
## [1] 8.362371
Our k-fold cross-validation shows that performance on this model is very good, 8.362371%.
summary(house_prices_outliers_removed.lm)
##
## Call:
## lm(formula = log(SalePrice) ~ GrLivArea + LotArea + GarageArea +
## FullBath + HalfBath + TotalBsmtSF + OverallQual + OverallCond +
## YearBuilt + BldgType + CentralAir, data = house_prices[-influential_indices,
## ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.49994 -0.06512 -0.00222 0.06288 0.34305
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.011e+00 3.946e-01 7.631 5.34e-14 ***
## GrLivArea 2.725e-04 1.291e-05 21.110 < 2e-16 ***
## LotArea 3.283e-06 5.605e-07 5.858 6.32e-09 ***
## GarageArea 2.207e-04 2.205e-05 10.010 < 2e-16 ***
## FullBath 3.749e-03 1.012e-02 0.370 0.71115
## HalfBath 1.635e-03 8.697e-03 0.188 0.85088
## TotalBsmtSF 1.686e-04 1.136e-05 14.845 < 2e-16 ***
## OverallQual 7.691e-02 4.124e-03 18.651 < 2e-16 ***
## OverallCond 5.328e-02 3.629e-03 14.682 < 2e-16 ***
## YearBuilt 3.797e-03 2.050e-04 18.527 < 2e-16 ***
## BldgType2fmCon -3.330e-02 3.353e-02 -0.993 0.32100
## BldgTypeDuplex -1.141e-01 2.130e-02 -5.355 1.06e-07 ***
## BldgTypeTwnhs -1.575e-01 2.020e-02 -7.797 1.56e-14 ***
## BldgTypeTwnhsE -3.866e-02 1.335e-02 -2.895 0.00387 **
## CentralAirY 4.515e-02 1.847e-02 2.444 0.01468 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1075 on 1019 degrees of freedom
## Multiple R-squared: 0.9161, Adjusted R-squared: 0.9149
## F-statistic: 794.2 on 14 and 1019 DF, p-value: < 2.2e-16
We are satisfied with model prediction performance, and so we check to see if the model can be simplified further using automated techniques like stepwise AIC and BIC automated model selection.
We start with our model and try backwards AIC.
house_prices_outliers_removed.lm = lm(log(SalePrice) ~ GrLivArea + LotArea + GarageArea + BedroomAbvGr + FullBath + HalfBath + TotalBsmtSF + OverallQual + OverallCond + YearBuilt + BldgType + HouseStyle + CentralAir + Heating, data = house_prices[-influential_indices,])
house_prices_outliers_removed.lm = step(house_prices_outliers_removed.lm, direction = "backward", trace = 0)
summary(house_prices_outliers_removed.lm)
##
## Call:
## lm(formula = log(SalePrice) ~ GrLivArea + LotArea + GarageArea +
## BedroomAbvGr + TotalBsmtSF + OverallQual + OverallCond +
## YearBuilt + BldgType + CentralAir + Heating, data = house_prices[-influential_indices,
## ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.51584 -0.06309 -0.00132 0.06324 0.33611
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.900e+00 3.503e-01 8.280 3.86e-16 ***
## GrLivArea 2.981e-04 1.129e-05 26.390 < 2e-16 ***
## LotArea 3.254e-06 5.554e-07 5.858 6.32e-09 ***
## GarageArea 2.043e-04 2.221e-05 9.201 < 2e-16 ***
## BedroomAbvGr -2.042e-02 5.783e-03 -3.532 0.000431 ***
## TotalBsmtSF 1.653e-04 1.045e-05 15.815 < 2e-16 ***
## OverallQual 7.545e-02 4.131e-03 18.265 < 2e-16 ***
## OverallCond 5.319e-02 3.616e-03 14.711 < 2e-16 ***
## YearBuilt 3.869e-03 1.812e-04 21.353 < 2e-16 ***
## BldgType2fmCon -1.789e-02 3.364e-02 -0.532 0.595095
## BldgTypeDuplex -1.177e-01 2.145e-02 -5.487 5.15e-08 ***
## BldgTypeTwnhs -1.671e-01 2.015e-02 -8.290 3.57e-16 ***
## BldgTypeTwnhsE -5.342e-02 1.389e-02 -3.847 0.000127 ***
## CentralAirY 6.392e-02 2.028e-02 3.152 0.001671 **
## HeatingGasW 3.310e-02 3.432e-02 0.965 0.334947
## HeatingGrav 3.826e-04 5.167e-02 0.007 0.994094
## HeatingOthW -5.043e-02 7.861e-02 -0.641 0.521349
## HeatingWall 1.780e-01 6.587e-02 2.703 0.006983 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1065 on 1016 degrees of freedom
## Multiple R-squared: 0.9178, Adjusted R-squared: 0.9164
## F-statistic: 667.3 on 17 and 1016 DF, p-value: < 2.2e-16
length(coef(house_prices_outliers_removed.lm))
## [1] 18
(mape6 = kfold_cv(house_prices[-influential_indices,], house_prices_outliers_removed.lm, 6, log_response = TRUE))
## Warning in split.default(sample(1:nrow(data)), 1:k): data length is not a
## multiple of split variable
## [1] 8.291547
We see that AIC performance is marginally better (8.291547) with one fewer predictor (full bath).
Similarly, we try backwards BIC.
house_prices_outliers_removed.lm = lm(log(SalePrice) ~ GrLivArea + LotArea + GarageArea + BedroomAbvGr + FullBath + HalfBath + TotalBsmtSF + OverallQual + OverallCond + YearBuilt + BldgType + HouseStyle + CentralAir + Heating, data = house_prices[-influential_indices,])
house_prices_outliers_removed.lm = step(house_prices_outliers_removed.lm, direction = "backward", k = log(nrow(house_prices[-influential_indices,])), trace = 0)
summary(house_prices_outliers_removed.lm)
##
## Call:
## lm(formula = log(SalePrice) ~ GrLivArea + LotArea + GarageArea +
## BedroomAbvGr + TotalBsmtSF + OverallQual + OverallCond +
## YearBuilt + BldgType + CentralAir, data = house_prices[-influential_indices,
## ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.51748 -0.06283 -0.00139 0.06294 0.33639
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.895e+00 3.448e-01 8.398 < 2e-16 ***
## GrLivArea 2.990e-04 1.130e-05 26.445 < 2e-16 ***
## LotArea 3.275e-06 5.563e-07 5.886 5.36e-09 ***
## GarageArea 2.078e-04 2.214e-05 9.389 < 2e-16 ***
## BedroomAbvGr -2.076e-02 5.773e-03 -3.596 0.000339 ***
## TotalBsmtSF 1.628e-04 1.041e-05 15.630 < 2e-16 ***
## OverallQual 7.483e-02 4.118e-03 18.174 < 2e-16 ***
## OverallCond 5.281e-02 3.605e-03 14.648 < 2e-16 ***
## YearBuilt 3.881e-03 1.787e-04 21.718 < 2e-16 ***
## BldgType2fmCon -2.294e-02 3.340e-02 -0.687 0.492399
## BldgTypeDuplex -1.084e-01 2.114e-02 -5.126 3.53e-07 ***
## BldgTypeTwnhs -1.674e-01 2.018e-02 -8.296 3.38e-16 ***
## BldgTypeTwnhsE -5.357e-02 1.389e-02 -3.855 0.000123 ***
## CentralAirY 5.174e-02 1.832e-02 2.825 0.004822 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1068 on 1020 degrees of freedom
## Multiple R-squared: 0.9171, Adjusted R-squared: 0.916
## F-statistic: 867.9 on 13 and 1020 DF, p-value: < 2.2e-16
length(coef(house_prices_outliers_removed.lm))
## [1] 14
(mape7 = kfold_cv(house_prices[-influential_indices,], house_prices_outliers_removed.lm, 6, log_response = TRUE))
## Warning in split.default(sample(1:nrow(data)), 1:k): data length is not a
## multiple of split variable
## [1] 8.296839
We see that the resulting model performs well (8.2968394 - only fractionally better than our originally proposed model), and uses one less predictor (full bath) than our original model.
This is identical to the AIC-based model, above.
The exclusion of “full bath” from our new model is not suprising when you observe the t-tests for full bath in the preceding sections. This does suggest that the variable is insignificant. We look at a plot of correlation to see if it brings insight into the data.
X_numeric = na.omit(house_prices[sapply(house_prices, is.numeric)])
correlation = round(cor(X_numeric, use = "everything"), 4)
Now plot the correlations:
corrplot(correlation, type = "upper", tl.pos = "td",
method = "circle", tl.cex = 0.5, tl.col = 'black',
diag = FALSE)
Indeed, we see here that the number of full baths is strongly positively correlated with with general living area and overall quality of the home. Since both of those variables are already in our model, it is not surprising that full bath is somewhat superfluous.
We have selected a final model and validated its performance and adherence to LINE assumptions. We finally demonstrate the performance of the model and its diagnostics.
Let’s see how this model performs for making predictions against our test data.
influential_indices = as.vector(which(cooks.distance(house_prices.lm) > 4/length(cooks.distance(house_prices.lm))))
house_prices_outliers_removed.lm = lm(log(SalePrice) ~ GrLivArea + LotArea + GarageArea + FullBath + HalfBath + TotalBsmtSF + OverallQual + OverallCond + YearBuilt + BldgType + CentralAir, data = house_prices[-influential_indices, ])
house_prices_prediction = exp(predict(house_prices_outliers_removed.lm,
newdata = house_prices_test))
mape(actual = house_prices_test$SalePrice, house_prices_prediction)
## [1] 10.60843
A 10.61% error falls within our target goal interval of 15% and ideally under 10%. We did not remove outliers from the test data and so it is not surprising that performance is slightly worse than we we saw during cross-validated testing of our training set.
One of our other goal was for the the model to adhere to LINE assumptions. We can see in the Fitted Vs Residuals and Q-Q Plot that the model demonstrates both normality and relatively constant variance.
diagnostics(house_prices_outliers_removed.lm)
Adjusted \(R^2\) value is:
get_adj_r2(house_prices_outliers_removed.lm)
## [1] 0.9148983
The model is excellent from this standpoint, with over 92% of the variance in responses being accounted for by our regression.
Finally, we wanted to ensure that our model was stable and did not contain collinearity.
We calculate the Variance Inflation Factor for our parameters.
vif(house_prices_outliers_removed.lm)
## GrLivArea LotArea GarageArea FullBath HalfBath
## 3.720711 1.247962 1.767703 2.709619 1.732507
## TotalBsmtSF OverallQual OverallCond YearBuilt BldgType2fmCon
## 1.895945 2.705943 1.377567 3.138659 1.059470
## BldgTypeDuplex BldgTypeTwnhs BldgTypeTwnhsE CentralAirY
## 1.070334 1.128769 1.191084 1.325499
We can see that all of vif’s are less than 5, indicating that multicollinearity should not be an issue with the model.
Due to the overwhelming number of variables in this data set, we felt it was best to select a probable set of variables based on domain insight gleaned from study of real estate websites. We used statistical analysis to confirm our choices and to tune the model appropriately. The reasoning being this approach is simple: the things that a buyer values in a home are likely propotional to the cost. Real estate offices and websites have likely narrowed down the things that are important to buyers so that the buyers can find a home easily and make a purchase through them.
Since all of the houses in this data set came from the same geographic area (Ames, Iowa), we can assume that the physical characteristics of the house determined its price, as opposed to the comparison of prices being driven by the housing market in which it resides (i.e. comparing prices in completely different markets, such as, South Dakota and southern California).
Our approach was incremental, starting with a proposed model and then using cross-validation and model diagnostics to continually refine the model and improve its characteristics.
As we could see from the small average percent error and high \(R^2\) of our model, the characteristics most prominently featured on a real estate website were highly correlated with the price of the home.
In conclusion, the model performs well across multiple metrics, adhere adequately to LINE assumptions, and makes quite accurate predictions. This gives evidence to support the hypothesis that a good model can be made from the most prominent features displayed on real estate websites.